Hamiltonian Monte Carlo and the No-U-Turn-Sampler

Computational Statistics Project

Jonathan Fitter (jfitter@wu.ac.at)
Max Heinze (mheinze@wu.ac.at)

Department of Economics, Vienna University of Economics and Business (WU Vienna)

May 27, 2025

 

 

 

Intro and Recap

Hamiltonian Monte Carlo

The No-U-Turn-Sampler (NUTS)

Our Implementation

Markov Chain

A Markov chain is a stochastic process \(\{X_t\}\) indexed by time \(t\geq 0\).

  • We call the set of all values that \(\{X_t\}\) can assume the state space of the Markov Chain.
  • \(\{X-t\}\) is called a Markov Chain if it fulfills the Markov property that \[ P(X_{t+1}=j\mid X_0=i_0,\dots,X_t=i_t)=P(X_{t+1}=j\mid X_t=i_t), \] i.e., that the conditional distribution of the next state depends only on the current state.

Markov Chain Monte Carlo (MCMC) methods make use of Markov chains to sample from a target distribution.

Markov Chain

A B C 0.5 0.3 0.2 0.6 0.4 0.7 0.3

Markov chains are

  • irreducible, i.e. every state can eventually be reached from any state;
  • aperiodic, i.e. reverting to a state is not only possible after a multiple of a certain number of steps;
  • positive recurrent, i.e. for all states, the expected number of transitions to return to that state is finite.

For such Markov Chains, transition probabilities converge to a unique stationary distribution on the state space.

Metropolis-Hastings

i j k g ( j | i ) α ( i,j ) 1 h = i g ( h | i ) α ( i,h ) g ( i | j ) α ( j,i ) g ( k | j ) α ( j,k ) g ( i | k ) α ( k,i ) 1 h = k g ( h | k ) α ( k,h )

Metropolis-Hastings is a class of MCMC methods.

  • Propose \(Y\sim g(\cdot\mid X_t)\), compute \(\alpha(i,j)=\min\Bigl(1,\frac{f(j)\,g(i\mid j)}{f(i)\,g(j\mid i)}\Bigr)\).
  • If \(U\le\alpha(i,j)\), move \(i\to j\); else stay at \(i\).
  • Chain is reversible and has \(f\) as its stationary distribution.

 

 

Intro and Recap

Hamiltonian Monte Carlo

The No-U-Turn-Sampler (NUTS)

Our Implementation

 

Origins in Physics

Imagine a frictionless puck on a non-even surface. The state of this system is given by

  • \(\boldsymbol{q}\), the position of the puck;
  • \(\boldsymbol{p}\), the momentum of the puck, given by \(\mathrm{mass}\times\mathrm{velocity}\).
  • If the puck encounters a rising slope, momentum will decrease; if it encounters a falling slope, momentum will increase.
  • The higher the surface at a position, the higher the puck’s potential energy, and the lower its kinetic energy.

From Physics to Statistics

In non-physical applications,

  • the position \(\boldsymbol{q}\) corresponds to the variables of interest,
  • the potential energy \(\boldsymbol{p}\) corresponds to the negative log of the probability density for these variables, and
  • the momentum is an auxiliary variable that facilitates exploration of the target distribution space.

Hamiltonian Dynamics

The system is described by a so-called Hamiltonian equation, \(H(q_p)\). Its partial derivatives, the equations of motion, determine how \(\boldsymbol{q}\) and \(\boldsymbol{p}\) change over time:

\[ \begin{aligned} \frac{\mathrm{d}q_i}{\mathrm{d}t}&=\frac{\partial H}{\partial p_i},\\ \frac{\mathrm{d}p_i}{\mathrm{d}t}&=-\frac{\partial H}{\partial q_i},\\ \end{aligned} \]

for \(i = 1, \dots, d\), and \(d\) being the length of the vectors \(\boldsymbol{q}\) and \(\boldsymbol{p}\).

Potential and Kinetic Energy

For Hamiltonian Monte Carlo, we usually use the following kind of Hamiltonian functions:

\[ H(\boldsymbol{q},\boldsymbol{p}) = \textcolor{var(--secondary-color)}{U(\boldsymbol{q})} + \textcolor{var(--tertiary-color)}{K(\boldsymbol{p})}, \]

where the potential energy \(U(\boldsymbol{q})\) is defined as minus the log probability density of the distribution for \(\boldsymbol{q}\) that we wish to sample, plus a constant;

and the kinetic energy \(K(\boldsymbol{q})\) is given by

\[ \textcolor{var(--tertiary-color)}{K(\boldsymbol{p})} = \boldsymbol{p}'\boldsymbol{M}^{-1}\boldsymbol{p}/2, \]

where \(\boldsymbol{M}\) is a symmetric, p.s.d., and often diagonal mass matrix.

Hamiltonian Dynamics

For Hamiltonian Monte Carlo, we usually use the following kind of Hamiltonian functions:

\[ H(\boldsymbol{q},\boldsymbol{p}) = \textcolor{var(--secondary-color)}{U(\boldsymbol{q})} + \textcolor{var(--tertiary-color)}{K(\boldsymbol{p})}, \]

Using this specification, the Hamiltonian functions can be written as:

\[ \begin{aligned} \frac{\mathrm{d}q_i}{\mathrm{d}t}&=[\boldsymbol{M}^{-1}\boldsymbol{p}]_i,\\ \frac{\mathrm{d}p_i}{\mathrm{d}t}&=-\frac{\partial U}{\partial q_i}.\\ \end{aligned} \]

Properties

Hamiltonian dynamics fulfill a set of properties that make them suitable for use in MCMC updating:

  • Reversibility. The mapping from the state at time \(t\) to the state at time \(t+s\) is one-to-one and hence has an inverse. This means that Markov chain transitions are reversible.
  • Conservation of the Hamiltonian. With dynamics as specified, the Hamiltonian \(H\) itself is kept invariant. For Metropolis updates, the acceptance probability is one if \(H\) is kept invariant (only approximatively achievable).
  • Volume preservation. Hamiltonian dynamics preserves volume in \((\boldsymbol{q},\boldsymbol{p})\) space, meaning that we do npt need to account for a change in volume in the acceptance probability for Metropolis updates.

Reversibility and preservation of volume can be maintained even when the Hamiltonian is approximated.

Discretization

For implementation, we need to discretize the Hamiltonian equations using a small step size \(\varepsilon\): time is then discrete with \(t = 0, \varepsilon, 2\varepsilon, 3\varepsilon, \dots\)

The simplest way to approximate the solution is Euler’s method:

\[ p_i(t+\varepsilon) = p_i(t) - \varepsilon \frac{\partial U}{\partial q_i}(q(t)),\qquad q_i(t+\varepsilon) = q_i(t)+\varepsilon\frac{p_i(t)}{m_i} \]

We can obtain better results by slightly modifying Euler’s method:

\[ p_i(t+\varepsilon) = p_i(t) - \varepsilon \frac{\partial U}{\partial q_i}(q(t)),\qquad q_i(t+\varepsilon) = q_i(t)+\varepsilon\frac{\textcolor{var(--primary-color)}{p_i(t+\varepsilon)}}{m_i} \]

For even better results, we can use the Leapfrog method:

\[ \begin{aligned} &\textcolor{var(--secondary-color)}{p_i(t+\varepsilon/2)} = p_i(t) - \textcolor{var(--secondary-color)}{(\varepsilon/2)} \frac{\partial U}{\partial q_i}(q(t)),\qquad\qquad q_i(t+\varepsilon) = q_i(t)+\varepsilon\frac{\textcolor{var(--secondary-color)}{p_i(t+\varepsilon/2)}}{m_i},\\ &\textcolor{var(--secondary-color)}{p_i(t+\varepsilon) = p_i(t+\varepsilon/2) - (\varepsilon/2) \frac{\partial U}{\partial q_i}(q(t+e))} \end{aligned} \]

Discretization

In this example, \(H(q,p)=q^2/2+p^2+2\). The initial state was \(q=0,p=1\). We can see that the leapfrog method preserves volume exactly.

Hamiltonian Monte Carlo

Using Hamiltonian dynamics to sample from a distribution requires translating the density to a potential energy function and introducing momentum variables. We can use the concept of a canonical distribution from statistical mechanics. Given an energy function \(E(x)\) for a state \(x\), the canonical distribution over states has density \(P(x)=(1/Z)\mathrm{exp}(-E(x)/T)\), where \(T\) is temperature and \(Z\) is a normalizing constant. Using the Hamiltonian \(H(\boldsymbol{q},\boldsymbol{p})=U(\boldsymbol{q})+K(\boldsymbol{p})\) as an energy function, we get

\[ P(\boldsymbol{q},\boldsymbol{p}) = \frac{1}{Z}\mathrm{exp}(-U(\boldsymbol{q})/T)\mathrm{exp}(-K(\boldsymbol{p})/T). \]

We can see that \(\boldsymbol{q}\) and \(\boldsymbol{p}\) are independent, and both have canonical distributions. The former will be used to represent our variables of interest, and the latter for momentum.

The Two Steps of HMC

Each iteration has two steps:

  1. In the first step, new values for the momentum variables are drawn from their Gaussian distribution.
  2. In the second step, a Metropolis update is performed using Hamiltonian dynamics as explained before. We get a proposed new state \((\boldsymbol{q}^*,\boldsymbol{p}^*)\). This state is accepted as the next state of the Markov chain with probability \(\mathrm{min}\Big(1,\quad \mathrm{exp}(-H(\boldsymbol{q}^*,\boldsymbol{p}^*)+H(\boldsymbol{q},\boldsymbol{p}))\Big)\quad =\quad \mathrm{min}\Big(1,\quad \mathrm{exp} \big(-U(\boldsymbol{q}^*)+U(\boldsymbol{q})-K(\boldsymbol{p}^*)+K(\boldsymbol{p})\big)\Big).\) If the proposed state is not accepted, the next state is set equal to the current state.

Only in the very first step of the chain does the probability density for \((\boldsymbol{q},\boldsymbol{p})\) change from one step to the next. Also, the algorithm leaves the canonical distribution invariant.

Benefits of HMC

We can use HMC to sample only from continuous distributions on \(\mathbb{R}^d\) for which the density function can be evaluated and the partial derivatives of its log can be computed.

It allows us to sample more efficiently from such distributions than simpler methods such as random-walk Metropolis.

Tuning HMC

However, in order to actually attain the benefits of HMC, we have to properly tune the step size \(\varepsilon\) and the trajectory length \(L\), i.e., the number of leapfrog steps. Tuning an HMC algorithm is also more difficult than tuning a simple Metropolis method.

  • If the step size \(\varepsilon\) is too large, acceptance rates will be low.
  • If the step size \(\varepsilon\) is too small, we will waste computation time.
  • If the trajectory length \(L\) is too large, trajectories may be long and still end up back at the starting point.
  • If the trajectory length \(L\) is too small, it may not reach far enough into unconstrained directions. In connection with a small step size \(\varepsilon\), it may lead to slow exploration by a random walk.

Illustration

 

Intro and Recap

Hamiltonian Monte Carlo

The No-U-Turn-Sampler (NUTS)

Our Implementation

 

 

Why NUTS?

Setting L

Eliminating the Need to Set L

NUTS

Adaptively Tuning ε

Dual-Averaging NUTS

Illustration

Intro and Recap

Hamiltonian Monte Carlo

The No-U-Turn-Sampler (NUTS)

Our Implementation

 

 

 

Slide

References

This list is scrollable.

Feng, C. (2017). The markov-chain monte carlo interactive gallery (Version c4723db). https://github.com/chi-feng/mcmc-demo.
Hoffman, M. D., Gelman, A., et al. (2014). The no-u-turn sampler: Adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res., 15(1), 1593–1623.